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Abstract 

We consider the problem of PAC-learning from distributed data and analyze fundamental commu- 
nication complexity questions involved. We provide general upper and lower bounds on the amount of 
communication needed to learn well, showing that in addition to VC-dimension and covering number, 
quantities such as the teaching-dimension and mistake-bound of a class play an important role. We also 
present tight results for a number of common concept classes including conjunctions, parity functions, 
and decision lists. For linear separators, we show that for non-concentrated distributions, we can use a 
version of the Perceptron algorithm to learn with much less communication than the number of updates 
given by the usual margin bound. We also show how boosting can be performed in a generic manner 
in the distributed setting to achieve communication with only logarithmic dependence on 1 /e for any 
concept class, and demonstrate how recent work on agnostic learning from class-conditional queries can 
be used to achieve low communication in agnostic settings as well. We additionally present an analysis 
of privacy, considering both differential privacy and a notion of distributional privacy that is especially 
appealing in this context. 

1 Introduction 

Suppose you have two databases: one with the positive examples and another with the negative examples. 
How much communication between them is needed to learn a good hypothesis? In this paper we consider 
this question and its generalizations, as well as related issues such as privacy. Broadly, we consider a 
framework where information is distributed between several locations, and our goal is to learn a low-error 
hypothesis with respect to the overall distribution of data using as little communication, and as few rounds 
of communication, as possible. Motivating examples include: 

1. Suppose k research groups around the world have collected large scientific datasets, such as genomic 
sequence data or sky survey data, and we wish to perform learning over the union of all these different 
datasets without too much communication. 
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2. Suppose we are a sociologist and want to understand what distinguishes the clientele of two retail- 
ers (Macy's vs Walmart). Each retailer has a large database of its own customers and we want to 
learn a classification rule that distinguishes them. This is an instance of the case of each database 
corresponding to a different label. It also brings up natural privacy issues. 

3. Suppose k hospitals with different distributions of patients want to learn a classifier to identify a 
common misdiagnosis. Here, in addition to the goal of achieving high accuracy, low communication, 
and privacy for patients, the hospitals may want to protect their own privacy in some formal way as 
well. 

We note that we are interested in learning a single hypothesis h that performs well overall, rather than 
separate hypotheses hi for each database. For instance, in the case that one database has all the positive 
examples and another has all the negatives, the latter problem becomes trivial. More generally, we are 
interested in understanding the fundamental communication complexity questions involved in distributed 
learning, a topic that is becoming increasingly relevant to modern learning problems. These issues, more- 
over, appear to be quite interesting even for the case of k = 2 entities. 

1.1 Our Contributions 

We consider and analyze fundamental communication questions in PAC-learning from distributed data, pro- 
viding general upper and lower bounds on the amount of communication needed to learn a given class, as 
well as broadly-applicable techniques for achieving communication-efficient learning. We also analyze a 
number of important specific classes, giving efficient learning algorithms with especially good communi- 
cation performance, as well as in some cases counterintuitive distinctions between proper and non-proper 
learning. 

Our general upper and lower bounds show that in addition to VC -dimension and covering number, quan- 
tities such as the teaching-dimension and mistake-bound of a class play an important role in determining 
communication requirements. We also show how boosting can be performed in a communication-efficient 
manner, achieving communication depending only logarithmically on 1/e for any class, along with trade- 
offs between total communication and number of communication rounds. Further we show that, ignoring 
computation, agnostic learning can be performed to error 0(opt(%)) + e with logarithmic dependence on 
1/e, by adapting results of Balcan and Hanneke [2012]. 

In terms of specific classes, we present several tight bounds including a Q(dlogd) bound on the com- 
munication in bits needed for learning the class of decision lists over {0, l} d . For learning linear separators, 
we show that for non-concentrated distributions, we can use a version of the Perceptron algorithm to learn 
using only 0(yJd\og(d/e) /e 2 ) rounds of communication, each round sending only a single hypothesis vec- 
tor, much less than the 0(d/e 2 ) total number of updates performed by the Perceptron algorithm. For parity 
functions, we give a rather surprising result. For the case of two entities, while proper learning has an Q,(d 2 ) 
lower bound based on classic results in communication complexity, we show that non-proper learning can 
be done efficiently using only 0(d) bits of communication. This is a by-product of a general result regarding 
concepts leamable in the reliable-useful framework of Rivest and Sloan [1988]. For a table of results, see 
Appendix A. 

We additionally present an analysis of communication-efficient privacy-preserving learning algorithms, 
considering both differential privacy and a notion of distributional privacy that is especially appealing in 
this context. We show that in many cases we can achieve privacy without incurring any additional commu- 
nication penalty. 
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More broadly, in this work we propose and study communication as a fundamental resource for PAC- 
learning in addition to the usual measures of time and samples. We remark that all our algorithms for specific 
classes address communication while maintaining efficiency along the other two axes. 

1.2 Related Work 

Related work in computational learning theory has mainly focused on the topic of learning and parallel 
computation. Bshouty [1997] shows that many simple classes that can be PAC learned can not be efficiently 
learned in parallel with a polynomial number of processors. Long and Servedio [201 1] show a parallel algo- 
rithm for large margin classifiers running in time 0(1/7) compared to more naive implementations costing 
of f](l/7 2 ), where 7 is the margin. They also show an impossibility result regarding boosting, namely that 
the ability to call the weak learner oracle multiple times in parallel within a single boosting stage does not 
reduce the overall number of successive stages of boosting that are required. Collins et al. [2002] give an 
online algorithm that uses a parallel-update method for the logistic loss, and Zinkevich et al. [2010] give a 
detailed analysis of a parallel stochastic gradient descent in which each machine processes a random subset 
of the overall data, combining hypotheses at the very end. All of the above results are mainly interested in 
reducing the time required to perform learning when data can be randomly or algorithmically partitioned 
among processors; in contrast, our focus is on a setting in which we begin with data arbitrarily partitioned 
among the entities. Dekel et al. [2011] consider distributed online prediction with arbitrary partitioning of 
data streams, achieving strong regret bounds; however, in their setting the goal of entities is to perform well 
on their own sequence of data. 

In very recent independent work, Daume III et al. [2012a] examine a setting much like that considered 
here, in which parties each have an arbitrary partition of an overall dataset, and the goal is to achieve low 
error over the entire distribution. They present comunication-efficient learning algorithms for axis-parallel 
boxes as well as for learning linear separators in R 2 . Daume III et al. [2012b], also independently of our 
work, extend this to the case of linear separators in R d , achieving bounds similar to those obtained via our 
distributed boosting results. Additionally, they consider a range of distributed optimization problems, give 
connections to streaming algorithms, and present a number of experimental results. Their work overall is 
largely complementary to ours. 

2 Model and Objectives 

Our model can be viewed as a distributed version of the PAC model. We have k entities (also called "play- 
ers") denoted by K and an instance space X. For each entity i G K there is a distribution Di over X that 
entity i can sample from. These samples are labeled by an unknown target function /. Our goal is to find 
a hypothesis h which approximates / well on the joint mixture D(x) = \ Yli=i I n the realizable 

case, we are given a concept class 7~L such that / S T~L; in the agnostic case, our goal is to perform nearly as 
well as the best h' € %. 

In order to achieve our goal of approximating / well with respect to D, entities can communicate with 
each other, for example by sending examples or hypotheses. At the end of the process, each entity should 
have a hypothesis of low error over D. In the center version of the model there is also a center, with initially 
no data of its own, mediating all the interactions. In this case the goal is for the center to obtain a low 
error hypothesis h. In the no-center version, the players simply communicate directly. In most cases, the 
two models are essentially equivalent; however (as seen in Section 5), the case of parity functions forms a 
notable exception. We assume the Di are not known to the center or to any entity j ^ i (in fact, Di is not 
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known explicitly even by entity i, and can be approximated only via sampling). Finally, let d denote the VC 
dimension of H, and e denote our target error rate in the realizable case, or our target gap with respect to 
opt(%) in the agnostic case. 1 We will typically think of k as much smaller than d. 

Remark: We are assuming all players have the same weight, but all results extend to players with 
different given weights. We also remark that except for our generic results, all our algorithms for specific 
classes will be computationally efficient (see Appendix A). 

Communication Complexity 

Our main focus is on learning methods that minimize the communication needed in order to leam well. 
There are two critical parameters, the total communication (either in terms of bits transmitted or examples 
or hypotheses transmitted ) and latency (number of rounds required). Also, in comparison to the baseline 
algorithm of having each database send all (or a random sample of) its data to a center, we will be looking 
both at methods that improve over the dependence on e and that improve over the dependence on d in 
terms of the amount of communication needed (and in some cases we will be able to improve in both 
parameters). In both cases, we will be interested in the tradeoffs between total communication and the 
number of communication rounds. The interested reader is referred to Kushilevitz and Nisan [1997] for an 
excellent exposition of communication complexity. 

When defining the exact communication model, it is important to distinguish whether entities can learn 
information from not receiving any data. For the most part we assume an asynchronous communication 
model, where the entities can not deduce any information when they do not receive the data (and there 
is no assumption about the delay of a message). In a few places we use a much stronger model of lock- 
synchronous communication, where the communication is in time slots (so you can deduce that no one sent 
a message in a certain time slot) and if multiple entities try to transmit at the same time only one succeeds. 
Note that if we have an algorithm with T time steps and C communication bits in the lock-synchronous 
model, using an exponential back-off mechanism [Herlihy and Shavit, 2008] and a synchronizer [Peleg, 
2000], we can convert it to an asynchronous communication with 0(T log k) rounds and 0((T + C) log k) 
communication bits. 

Privacy 

In addition to minimizing communication, it is also natural in this setting to consider issues of privacy, which 
we examine in Section 10. In particular, we will consider privacy of three forms: differential privacy for the 
examples (the standard form of privacy considered in the literature) , differential privacy for the databases 
(viewing each entity as an individual deserving of privacy, which requires k to be large for any interesting 
statements), and distributional privacy for the databases (a weaker form of privacy that we can achieve even 
for small values of k). See Dwork [2008] for an excellent survey of differential privacy. 

3 Baseline approaches and lower bounds 

We now describe two baseline methods for distributed learning as well as present general lower bounds. 

Supervised Learning: The simplest baseline approach is to just have each database send a random sample of 
size 0(r (~ log \)) to the center, which then performs the learning. This implies we have a total communica- 
tion cost of 0(- log i) in terms of number of examples transmitted. Note that while the sample received by 

'We will suppress dependence on the confidence parameter 8 except in cases where it behaves in a nontrivial manner. 
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the center is not precisely drawn from D (in particular, it contains the same number of points from each Di), 
the standard double-sample VC-dimension argument still applies, and so with high probability all consistent 
h € H. have low error. Similarly, for the agnostic case it suffices to use a total of 0(-^ log |) examples. In 
both cases, there is just one round of communication. 

EQ/online algorithms: A second baseline method is to run an Equivalence Query or online Mistake-Bound 
algorithm at the center. This method is simpler to describe in the lock-synchronization model. In each round 
the center broadcasts its current hypothesis. If any of the entities has a counter-example, it sends the counter- 
example to the center. If not, then we are done. The total amount of communication measured in terms of 
examples and hypotheses transmitted is at most the mistake bound M of the algorithm for learning Ti; in 
fact, by having each entity run a shadow copy of the algorithm, one needs only to transmit the examples and 
not the hypotheses. Note that in comparison to the previous baseline, there is now no dependence on e in 
terms of communication needed; however, the number of rounds may now be as large as the mistake bound 
M for the class Ti. Summarizing, 

Theorem 1. Any class % can be learned to error e in the realizable case using 1 round and 0(| log |) 
total examples communicated, or M rounds and M total examples communicated, where M is the optimal 
mistake bound for Ti. In the agnostic case, we can learn to error opt{Ji) + e using 1 round and 0(-^ log ^) 
total examples communicated. 

Another baseline approach is for each player to describe an approximation to the joint distribution in- 
duced by Di and / to the center, in cases where that can be done efficiently. See Appendix B.l for an 
example. 

We now present a general lower bound on communication complexity for learning a class Ti. Let 
N e ^(T-l) denote the size of the minimum e-cover of H with respect to D, and let N e (7-L) = sup D N^d^H). 
Let driT-l) denote the teaching dimension of class H. 2 

Theorem 2. Any class % requires il(log A r 2 e ('H)) bits of communication to learn to error e. This implies 
£l(d) bits are required to learn to error e < 1/8. For proper learning, £l(\og \H\) bits are required to learn 
to error e < 2 dT(H) ' These hold even for k = 2. 

Proof. Consider a distribution D\ such that N = N2 t ,D 1 {'H) is maximized. Let D 2 be concentrated on a 
single (arbitrary) point x. In order for player 2 to produce a hypothesis h of error at most e over D, h must 
have error at most 2e over D\. If player 2 receives fewer than \og 2 (N 2e (H)) — 1 bits from player 1, then 
(considering also the two possible labels of x) there are less than N 2e (H) possible hypotheses player 2 can 
output. Thus, there must be some / E Ti that has distance greater than 2e from all such hypotheses with 
respect to D±, and so player 2 cannot learn that function. The Q(d) lower bound follows from applying the 
above argument to the uniform distribution over d points shattered by H. 

For the f2(log |%|) lower bound, again let D 2 be concentrated on a single (arbitrary) point. If player 2 
receives fewer than ^ log |%| bits then there must be some h* G % it cannot output. Consider f = h* and 
let D\ be uniform over dT{T~L) points uniquely defining / within Ti. Since player 2 is a proper learner, it 
must therefore have error greater than 2e over D\, implying error greater than e over D. □ 

Note that there is a significant gap between the above upper and lower bounds. For instance, if data lies 
in {0, l} d , then in terms of d the upper bound in bits is 0(d 2 ) but the lower bound is Q(d) (or in examples, 
the upper bound is 0(d) but the lower bound is f2(l)). In the following sections, we describe our algorithmic 

2 dr("H) is defined as max/ 6 H dr(f) where dr{f) is the smallest number of examples needed to uniquely identify / within H 
[Goldman and Kearns, 1991]. 
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results for improving upon the above baseline methods, as well as stronger communication lower bounds for 
certain classes. We also show how boosting can be used to genetically get only a logarithmic dependence 
of communication on 1/e for any class, using a logarithmic number of rounds. 

4 Intersection-closed classes and version-space algorithms 

One simple case where one can perform substantially better than the baseline methods is that of intersection- 
closed (or union-closed) classes %, where the functions in % can themselves be compactly described. For 
example, the class of conjunctions and the class of intervals on the real line are both intersection-closed. For 
such classes we have the following. 

Theorem 3. IfH is intersection-closed, then H can be learned using one round and k hypotheses of total 
communication. 

Proof. Each entity i draws a sample of size 0(^(dlog(^)+\og(k/5))) and computes the smallest hypothesis 
hi € % consistent with its sample, sending hi to the center. The center then computes the smallest hypothesis 
h such that h 5 K for all i. With probability at least 1 — 8, h has error at most e on each D{ and therefore 
error at most e on D overall. □ 

Example (conjunctions over {0, l} d ): In this case, the above procedure corresponds to each player sending 
the bitwise-and of all its positive examples to the center. The center then computes the bitwise-and of the 
results. The total communication in bits is O(dk). Notice this may be substantially smaller than the 0(d 2 ) 
bits used by the baseline methods. 

Example (boxes in d-Dimensions): In this case, each player can send its smallest consistent hypothesis using 
2d values. The center examines the minimum and maximum in each coordinate to compute the minimal 
/i 2 K for all i. Total communication is O(dk) values. 

In Appendix B.2 we discuss related algorithms based on version spaces. 

5 Reliable-useful learning, parity, and lower bounds 

A classic lower bound in communication complexity states that if two entities each have a set of linear 
equalities over n variables, then Q(n 2 ) bits of communication are needed to determine a feasible solution, 
based on JaJa and Prasanna [1984]. This in turn implies that for proper learning of parity functions, Q(n 2 ) 
bits of communication are required even in the case k = 2, matching the baseline upper bound given via 
Equivalence Query algorithms. 

Interestingly, however, if one drops the requirement that learning be proper, then for k = 2, parity 
functions can be learned using only 0(n) bits of communication. Moreover, the algorithm is efficient. This 
is in fact a special case of the following result for classes that are learnable in the reliable-useful learning 
model of Rivest and Sloan [1988]. 

Definition 1. [Rivest and Sloan, 1988] An algorithm reliably and usefully learns a class T-L if given poly (n, 1/e, 1/5) 
time and samples, it produces a hypothesis h that on any given example outputs either a correct prediction 
or the statement "I don't know"; moreover, with probability at least 1 — 5 the probability mass of examples 
for which it answers "I don 't know " is at most e. 
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Theorem 4. Suppose H is properly PAC-learnable and is learnable (not necessarily properly) in the 
reliable-useful model. Then for k = 2, % can be learned in one round with 2 hypotheses of total com- 
munication (or 2b bits of communication if each h £ H can be described in b = 0(log \H\) bits). 

Proof. The algorithm is as follows. First, each player i properly PAC-learns / under Di to error e, creating 
hypothesis hi 6 %. It also learns / reliably-usefully to create hypothesis g, L having don't-know probability 
mass at most e under D{. Next, each player i sends hi to the other player (but not gi, because g. t may take 
too many bits to communicate since it is not guaranteed to belong to 1-1). Finally, each player i produces the 
overall hypothesis "If my own gi makes a prediction, then use it; else use the hypothesis hs-i that I received 
from the other player". Note that each player i's final hypothesis has error at most e under both D{ (because 
of gi) and D^-i (because h%-i has error at most e under and gi never makes a mistake) and therefore 
has error at most e under D. □ 

Example (parity functions): Parity functions are properly PAC learnable (by an arbitrary consistent 
solution to the linear equations defined by the sample). They are also learnable in the reliable-useful model 
by a (non-proper) algorithm that behaves as follows: if the given test example x lies in the span of the training 
data, then write x as a sum of training examples and predict the corresponding sum of labels. Else output "I 
don't know". Therefore, for k = 2, parity functions are learnable with only 0(n) bits of communication. 

Interestingly, the above result does not apply to the case in which there is a center that must also learn 
a good hypothesis. The reason is that the output of the reliable-useful learning procedure might have large 
bit-complexity, for example, in the case of parity it has a complexity of U(n 2 ). A similar problem arises 
when there are more than two entities. 3 

However, we can extend the result to the case of a center if the overall distribution D over unlabeled 
data is known to the players. In particular, after running the above protocol to error e/d, each player can 
then draw 0(d/e) fresh unlabeled points from D, label them using its learned hypothesis, and then perform 
proper learning over this data to produce a new hypothesis h! G % to send to the center. 

6 Decision Lists 

We now consider the class % of decision lists over d attributes. The best mistake-bound known for this class 
is 0(d 2 ), and its VC-dimension is 0(d). Therefore, the baseline algorithms give a total communication 
complexity, in bits, of 0(d 2 /e) for batch learning and 0(d 3 ) for the mistake-bound algorithm. 4 Here, we 
present an improved algorithm, requiring a total communication complexity of only 0(dk log d) bits. This 
is a substantial savings over both baseline algorithms, especially when k is small. Note that for constant k 
and for e = o(l/d), this bound matches the proper-learning 0(dlog d) lower bound of Theorem 2. 

Theorem 5. The class of decision lists can be efficiently learned with a total of at most 0(dk log d) bits of 
communication and a number of rounds bounded by the number of alternations in the target decision list f. 

Proof. The algorithm operates as follows. 

1. First, each player i draws a sample Sj of size 0(-(<ilog(-) + \og(k/5))), which is sufficient so that 
consistency with Si is sufficient for achieving low error over Di. 

3 It is interesting to note that if we allow communication in the classification phase (and not only during learning) then the center 
can simply send each test example to all entities, and any entity that classifies it has to be correct. 

4 One simple observation is the communication complexity of the mistake-bound algorithm can be reduced to 0(d 2 logd) by 
having each player, in the event of a mistake, send only the identity of the offending rule rather than the entire example; this requires 
only 0(log d) bits per mistake. However we will be able to beat this bound substantially. 
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2. Next, each player i computes the set Tj of all triplets (j, bj, Cj) such that the rule "if Xj = bj then cf 
is consistent with all examples in Si. (For convenience, use j = to denote the rule "else Cj".) Each 
player i then sends its set Tj to the center. 

3. The center now computes the intersection of all sets T received and broadcasts the result T = DiTj 
to all players, i.e., the collection of triplets consistent with every Si. 

4. Each player i removes from Si all examples satisfied by T. 

5. Finally, we repeat steps 2,3,4 but in Step 2 each player only sending to the center any new rules that 
have become consistent since the previous rounds (the center will add them into T — note that there 
is never a need to delete any rule from T); similarly in Step 3 the center only sends new rules that 
have entered the intersection T. The process ends once an "else cf rule has entered T. The final 
hypothesis is the decision list consisting of the rules broadcast by the center, in the order they were 
broadcast. 

To analyze the above procedure, note first that since each player announces any given triplet at most once, 
and any triplet can be described using 0(log d) bits, the total communication in bits per player is at most 
0(d log d), for a total of 0(dk log d) overall. Next, note that the topmost rule in / will be consistent with 
each Si, and indeed so will all rules appearing before the first alternation in /. Therefore, these will be 
present in each Tj and thus contained in T. Thus, each player will remove all examples exiting through any 
such rule. By induction, after k rounds of the protocol, all players will have removed all examples in their 
datasets that exit in one of the top k alternations of /, and therefore in the next round all rules in the k + 1st 
alternation of / that have not been broadcast already will be output by the center. This implies the number 
of rounds will be bounded by the number of alternations of /. Finally, note that the hypothesis produced 
will by design be consistent with each Si since a new rule is added to T only when it is consistent with every 
Si. □ 



7 Linear Separators 

We now consider the case of learning homogeneous linear separators in R d . For this problem, we will 
for convenience discuss communication in terms of the number of vectors transmitted, rather than bits. 
However, for data of margin 7, all vectors transmitted can be given using 0(dlog I/7) bits each. 

One simple case is when D is a radially symmetric distribution such as the symmetric Gaussian dis- 
tribution centered at the origin, or the uniform distribution on the sphere. In that case, it is known that 
E x ^,_d[^(x)x/||x||], is a vector exactly in the direction of the target vector, where l{x) is the label of x. 
Moreover, an average over 0(d/e 2 ) samples is sufficient to produce an estimate of error at most e with high 
probability [Servedio, 2002]. Thus, so long as each player draws a sufficiently large sample Si, we can learn 
to any desired error e with a total communication of only k examples: each database simply computes an 
average over its own data and sends it to the center, which combines the results. 

The above result, however, requires very precise conditions on the overall distribution. In the following 
we consider several more general scenarios: learning a large-margin separator when data is "well-spread", 
learning over non-concentrated distributions, and learning linear separators without any additional assump- 
tions. 
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7.1 Learning large-margin separators when data is well-spread 

We say that data is a-w ell-spread if for all datapoints X{ and Xj we have J 3 ?,',^ ,, < a. In the following we 

J I Fill I Pill 

show that if data is indeed a-well-spread for a small value of a, then the Perceptron algorithm can be used 
to leam with substantially less communication than that given by just using its mistake-bound directly as in 
Theorem 1. 

Theorem 6. Suppose that data is a-well-spread and furthermore that all points have margin at least 7 with 
the target w*. Then we can find a consistent hypothesis with a version of the Perceptron algorithm using at 
most 0(k(l + a/7 2 )) rounds of communication, each round communicating a single hypothesis. 

Proof. We will run the algorithm in meta-rounds. Each meta-round will involve a round robin communica- 
tion between the players 1, . . . , k. Starting from initial hypothesis wq = 0, each player i will in turn run the 
Perceptron algorithm on its data until it finds a consistent hypothesis wt,i that moreover satisfies \wt -Xi\ > 1 
for all of its examples X{. It then sends the hypothesis Wf t i produced to player i + 1 along with the number of 
updates it performed, who continues this algorithm on its own data, starting from the most recent hypothesis 
w tj i. When player k sends w^k to player 1, we start meta-round t + 1. At the start of meta-round t + 1, 
player 1 counts the number of updates made in the previous meta-round, and if it is less than 1/a we stop 
and output the current hypothesis. 

It is known that this "Margin Perceptron" algorithm makes at most 3/7 2 updates in total. 5 Note that 
if in a meta-round all the players make less than 1/a updates in total, then we know the hypothesis will 
still be consistent with all players' data. That is because each update can decrease the inner product of the 
hypothesis with some x-i of another player by at most a. So, if less than 1/a updates occur, it implies that 
every player's examples are still classified correctly. This implies that the total number of communication 
meta-rounds until a consistent hypothesis is produced will be at most 1 + 3a /j 2 . In particular, this follows 
because the total number of updates is at most 3/7 2 , and each round, except the last, makes at least 1/a 
updates. □ 

7.2 Learning linear separators over non-concentrated distributions 

We now use the analysis of Section 7. 1 to achieve good communication bounds for learning linear separators 
over non-concentrated distributions. Specifically, we say a distribution over the d-dimensional unit sphere 
is non-concentrated if for some constant c, the probability density on any point x is at most c times greater 
than that of the uniform distribution over the sphere. The key idea is that in a non-concentrated distribution, 
nearly all pairs of points will be close to orthogonal, and most points will have reasonable margin with 
respect to the target. 

Theorem 7. For any non-concentrated distribution D over R d we can learn to error O(e) using only 
0(k 2 -\/dlog(dk/e) /e 2 ) rounds of communication, each round communicating a single hypothesis vector. 

Proof. Note that for any non-concentrated distribution D, the probability that two random examples x, x' 
from D satisfy \x ■ x'\ > t/^fd is e~°^ \ This implies that in a polynomial-size sample (polynomial in d 
and 1/e), with high probability, any two examples Xi, Xj in the sample satisfy \xi • Xj\ < yd \og{d/e)/n 
for some constant c'. Additionally, for any such distribution D there exists another constant c" such that for 
any e > 0, there is at most e probability mass of D that lies within margin j e = c"e/\fd of the target. 

5 Because after update r we get | |w T +i 1 1 2 < | \w T \ | 2 + 2£(xi)(w T ■ Xi) + 1 < \\w T \\ 2 + 3. 
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These together imply that using the proof idea of Theorem 6, we can learn to error 0(e) using only 
0(k 2 ^ d\og(dk / e) / e 2 ) communication rounds. Specifically, each player acts as follows. If the hypothesis 
w given to it has error at most e on its own data, then it makes no updates and just passes w along. Otherwise, 
it makes updates using the margin-perceptron algorithm by choosing random examples x from its own 
distribution D t satisfying £(x)(w -x) < 1 until the fraction of examples x under Di for which £(x)(w • x) < 
1 is at most e, sending the final hypothesis produced to the next player. Since before each update, the 
probability mass under Di of {x : £(x)(w ■ x) < 1} is at least e, the probability mass of this region under D 
is at least e/ (2k). This in turn means there is at least a 1/2 probability that the example used for updating has 
margin at least ^ e /(2k) = 0,(e/ (ky/d)) with respect to the target. Thus, the total number of updates made 
over the entire algorithm will be only 0(dk 2 /e 2 ). Since the process will halt if all players make fewer than 
1/a updates in a meta-round, for a = c! log(2dk /e) /n, this implies the total number of communication 
meta-rounds is 0(k 2 \/d\og(d/e) /e 2 ). □ 

Note that in Section 8 we show how boosting can be implemented communication-efficiently so that 
any class learnable to constant error rate from a sample of size 0(d) can be learned to error e with total 
communication of only 0(d log 1/e) examples (plus a small number of additional bits). However, as usual 
with boosting, this requires a distribution-independent weak learner. The "1/e 2 " term in the bound of 
Theorem 7 comes from the margin that is satisfied by a 1 — e fraction of points under a non-concentrated 
distribution, and so the results of Section 8 do not eliminate it. 

7.3 Learning linear separators without any additional assumptions 

If we are willing to have a bound that depends on the dimension d, then we can run a mistake-bound algo- 
rithm for learning linear separators, using Theorem 1. Specifically, we can use a mistake-bound algorithm 
based on reducing the volume of the version space of consistent hypotheses (which is a polyhedra). The 
initial volume is 1 and the final volume is j d , where 7 is the margin of the sample. In every round, each 
player checks if it has an example that reduces the volume by half (volume of hypotheses consistent with 
all examples broadcast so far). If it does, it sends it (we are using here the lock-synchronization model). 
If no player has such an example, then we are done. The hypothesis we have is for each x to predict with 
the majority of the consistent hypotheses. This gives a total of 0(d log I/7) examples communicated. In 
terms of bits, each example has d dimensions, and we can encode each dimension with 0(log I/7) bits, thus 
the total number of bits communicated is 0(d 2 log 2 I/7). Alternatively, we can replace the log I/7 term 
with a log 1/e term by using a PAC-learning algorithm to learn to constant error rate, and then applying the 
boosting results of Theorem 10 in Section 8 below. 

It is natural to ask whether running the Perceptron algorithm in a round-robin fashion could be used to 
improve the generic 0(l/7 2 ) communication bound given by the baseline results of Theorem 1, for general 
distributions of margin 7. However, in Appendix C we present an example where the Perceptron algorithm 
indeed requires 0,(1 /^ 2 ) rounds. 

Theorem 8. There are inputs for k = 2 with margin 7 such that the Perceptron algorithm takes 0(1 /^) 2 ) 
rounds. 

8 Boosting for Logarithmic Dependence on 1/e 

We now consider the general question of dependence of communication on 1/e, showing how boosting can 
be used to achieve 0(logl/e) total communication in 0(logl/e) rounds for any concept class, and more 
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generally a tradeoff between communication and rounds. 

Boosting algorithms provide a mechanism to produce an e-error hypothesis given access only to a weak 
learning oracle, which on any distribution finds a hypothesis of error at most some value /3 < 1/2 (i.e., 
a bias 7 = 1/2 — j3 > 0). Most boosting algorithms are weight-based, meaning they assign weights to 
each example x based solely on the performance of the hypotheses generated so far on x, with probabilities 
proportional to weights. 6 We show here that any weight-based boosting algorithm can be applied to achieve 
strong learning of any class with low overall communication. The key idea is that in each round, players 
need only send enough data to the center for it to produce a weak hypothesis. Once the weak hypothesis 
is constructed and broadcast to all the players, the players can use it to separately re-weight their own 
distributions and send data for the next round. No matter how large or small the weights become, each 
round only needs a small amount of data to be transmitted. Formally, we show the following: 

Lemma 9. Given any weight-based boosting algorithm that achieves error e by making r(e, (3) calls to a 
(3-weak learning oracle for H, we can construct a distributed learning algorithm achieving error e that uses 
0(r(e, (3)) rounds, each involving 0((d/(3) log(l//3)) examples and an additional 0(k \og(d/f3)) bits of 
communication per round. 

Proof. The key property of weight-based boosting algorithms that we will use is that they maintain a current 
distribution such that the probability mass on any example x is solely a function of the performance of the 
weak-hypotheses seen so far on x, except for a normalization term that can be communicated efficiently. 
This will allow us to perform boosting in a distributed fashion. Specifically, we run the boosting algorithm 
in rounds, as follows. 

Initialization: Each player i will have a weight Wit for round t. We begin with Wi t o = 1 for all i. Let 
Wt = Yli=i w i,t so initially Wq = k. These weights will all be known to the center. Each player i 
will also have a large weighted sample Si, drawn from Di, known only to itself. Si will be weighted 
according to the specific boosting algorithm (and for all standard boosting algorithms, the points in Si 
begin with equal weights). We now repeat the following three steps for t = 1, 2, 3, 

1. Pre-sampling The center determines the number of samples n^ t to request from each player i by sam- 

pling 0(|j log 4) times from the multinomial distribution w^t-x/Wt-i. It then sends each player i 
the number n^t, which requires only 0(log ^) bits. 

2. Sampling Each player i samples m t examples from its local sample Si in proportion to its own internal 

example weights, and sends them to the center. 

3. Weak-learning The center takes the union of the received examples and uses these log -|) samples 

to produce a weak hypothesis h t of error at most (3/2 over the current weighted distribution, which it 
then sends to the players. 7 

4. Updating Each player i, given h t , computes the new weight of each example in Si using the underlying 

boosting algorithm and sends their sum w^t to the center. This sum can be sent to sufficient accuracy 
using 0(log i) bits. 

6 E.g., Schapire [1990], Freund [1990], Freund and Schapire [1997], (For Adaboost, we are considering the version that uses a 
fixed upper bound (3 on the error of the weak hypotheses.) Normalization may of course be based on overall performance. 

7 In fact, because we have a broadcast model, technically the players each can observe all examples sent in step (2) and so can 
simulate the center in this step. 
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In each round, steps (1) and (2) ensure that the center receives 0((d/j3) log(l//3)) examples distributed 
according to a distribution D' matching that given by the boosting algorithm, except for small rounding 
error due to the number of bits sent in step (4). Specifically, the variation distance between D' and the 
distribution given by the boosting algorithm is at most (3/2. Therefore, in step (3), it computes a hypothesis 
ht with error at most /3/2 + /3/2 = /3 with respect to the current distribution given by the boosting algorithm. 
In step (4), the examples in all sets Si then have their weights updated as determined by the boosting 
algorithm, and the values wi j transmitted ensure that the normalizations are correct. Therefore, we are 
simulating the underlying boosting algorithm having access to a /3-weak learner, and so the number of rounds 
is r(e,/3). The overall communication per round is 0((d/f3) log(l//3)) examples plus 0(k\og(d/ (3)) bits 
for communicating the numbers n^t and w^t, as desired. □ 

By adjusting the parameter /3, we can trade off between the number of rounds and communication 
complexity. In particular, using Adaboost [Freund and Schapire, 1997] in Lemma 9 yields the following 
result (plugging in/3 = l/4or/3 = e 1 / respectively): 

Theorem 10. Any class % can be learned to error e in 0(log -) rounds and 0(d) examples plus 0(k log d) 
bits of communication per round. For any c > 1, % can be learned to error e in 0(c) rounds and 
0(~^jz l°g e ) examples plus 0(k log ^) bits communicated per round. 

Thus, any class of VC-dimension d can be learned using 0(log -) rounds and a total of O(dlog -) 
examples, plus a small number of extra bits of communication. 

9 Agnostic Learning 

Balcan and Hanneke [2012] show that any class Ti can be agnostically learned to error 0(opt(%)) + e using 
only 0(d\og 1/e) label requests, in an active learning model where class-conditional queries are allowed. 
We can use the core of their result to agnostically learn any finite class H. to error 0(opt(T~L)) + e in our 
setting, with a total communication that depends only (poly logarithmically on 1/e. The key idea is that 
we can simulate their robust generalized halving algorithm using communication proportional only to the 
number of class-conditional queries their algorithm makes. 

Theorem 11. Any finite class % can be learned to error 0(opt(%)) + e with a total communication 
of O ( k log( \H I ) log log( \H | ) log(l/e) ) examples an d O ( k log ( | U \ ) log log ( | U \ ) log 2 ( 1 /e) ) additional bits. 
The latter may be eliminated if shared randomness is available. 

Proof. We prove this result by simulating the robust generalized halving algorithm of Balcan and Hanneke 
[2012], for the case of finite hypothesis spaces, in a communication-efficient manner. 8 In particular, the 
algorithm operates as follows. For this procedure, N = 0(log log \7i\) and s = 0(l/(opt(%) + e)) is 
such that the probability that the best hypothesis in H will have some error on a set of s examples is a small 
constant.. 

1. We begin by drawing N sets Si, ... , <SV of size s from D. This can be implemented communication- 
efficiently as follows. For j = 1, . . . , N, player 1 makes s draws from {1, . . . , k} to determine the 
number riij of points in Sj that should come from each Dj. Player 1 then sends each player i the list 
(nn,rii2, . . . , uin), who draws (but keeps internally and does not send) riij examples of Sj for each 

8 The algorithm of Balcan and Hanneke [20 1 2] for the case of infinite hypothesis spaces begins by using a large unlabeled sample 
to determine a small e-cover of T-L. This appears to be difficult to simulate communication-efficiently. 
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1 < j < N. Total communication: 0(kN\og(s)) bits. Note that if shared randomness is available, 
then the computation of riij can be simulated by each player and so in that case no communication is 
needed in this step. 

2. Next we determine which sets Sj contain an example on which the majority-vote hypothesis over T-L, 
maj{Ji\ makes a mistake, and identify one such example (xj,yj) for each such set. We can im- 
plement this communication-efficiently by having each player i evaluate ma j (T-L) on its own portion 
of each set Sj and broadcast a mistake for each set on which at least one mistake is made. Total 
communication: 0(kN) examples. 

3. If no more than N/3 sets Sj contained a mistake for ma j (T-L) then halt. Else, remove from % each h 
that made mistakes on more than N/9 of the identified examples (xj,yj), and go to (1). This step can 
be implemented separately by each player without any communication. 

Balcan and Hanneke [2012] show that with high probability the above process halts within 0(log|H|) 
rounds, does not remove the optimal h G %, and furthermore that when it halts, maj(T-L) has error 
0(opt(%)) + e. The total amount of communication is therefore 0(klog(\'H\) loglog(|%|)) examples and 
0(k log(|7^|) log log(|%|) log(l/e)) additional bits. The above has been assuming that the value of opt (7^) 
is known; if not then one can perform binary search, multiplying the above quantities by an additional 
0(log(l/e)) term. Thus, we achieve the desired error rate within the desired communication bounds. □ 

10 Privacy 

In the context of distributed learning, it is also natural to consider the question of privacy. We begin by 
considering the well-studied notion of differential privacy with respect to the examples, showing how this 
can be achieved in many cases without any increase in communication costs. We then consider the case that 
one would like to provide additional privacy guarantees for the players themselves. One option is to view 
each player as a single (large) example, but this requires many players to achieve any nontrivial accuracy 
guarantees. Thus, we also consider a natural notion of distributional privacy, in which players do not view 
their distribution Di as sensitive, but rather only the sample Si drawn from it. We analyze how large a sample 
is sufficient so that players can achieve accurate learning while not revealing more information about their 
sample than is inherent in the distribution it was drawn from. We now examine each notion in turn, and for 
each we explore how it can be achieved and the effect on communication. 

10.1 Differential privacy with respect to individual examples 

In this setting we imagine that each entity i (e.g., a hospital) is responsible for the privacy of each example 
x G Si (e.g., its patients). In particular, suppose a denotes a sequence of interactions between entity i and 
the other entities or center, and a > is a given privacy parameter. Differential privacy asks that for any 
Si and any modification S' { of Si in which any one example has been arbitrarily changed, for all a we have 
e -a < Pr^. (cr)/Pr5/(cr) < e a , where probabilities are over internal randomization of entity i. (See Dwork 
[2006, 2008, 2009] for a discussion of motivations and properties of differential privacy and a survey of 
results). 

In our case, one natural approach for achieving privacy is to require that all interaction with each entity 
i be in the form of statistical queries [Kearns, 1998]. It is known that statistical queries can be implemented 
in a privacy-preserving manner [Dwork and Nissim, 2004, Blum et al., 2005, Kasiviswanathan et al., 2008], 
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and in particular that a sample of size 0(maxp-,4]log(M/<5)) is sufficient to preserve privacy while 
answering M statistical queries to tolerance r with probability 1 — 5. For completness, we present the proof 
below. 

Theorem 12. [Dwork and Nissim, 2004, Blum et al, 2005, Kasiviswanathan et al., 2008] If % is learnable 
using M statistical queries of tolerance t, then % is learnable preserving differential privacy with privacy 
parameter a from a sample S of size 0(max[^, ^] log(M/5)). 

Proof. For a single statistical query, privacy with parameter a' can be achieved by adding Laplace noise 
of width 0(^7j5|) to the empirical answer of the query on S. That is because changing a single entiy in 
S can change the empirical answer by at most 1/|<9|, so by adding such noise we have that for any v, 
Pvs(v)/Pvs'{v) < e Q . Note that with probability at least 1 — 5', the amount of noise added to any given 
answer is at most 0( jjTj^y log(l/#'))- Thus, if the overall algorithm requires M queries to be answered to 
tolerance r, then setting a' = a/M,5' = 6/(2M),t = log(l/<5')), 

privacy can be achieved so long 

as we have \S\ = 0(max[^, ^] log(M/<5)), where the second term of the max is the sample size needed 
to achieve tolerance r for M queries even without privacy considerations. As described in Dwork et al. 
[2010], one can achieve a somewhat weaker privacy guarantee using a' = 0{aj \[M). □ 

However, this generic approach may involve significant communication overhead over the best non- 
private method. Instead, in many cases we can achieve privacy without any communication overhead at all 
by performing statistical queries internally to the entities. For example, in the case of intersection-closed 
classes, we have the following privacy-preserving version of Theorem 3. 

Theorem 13. IfH can be properly learned via statistical queries to D + only, then % can be learned using 
one round and k hypotheses of total communication while preserving differential privacy. 

Proof. Each entity i learns a hypothesis hi £L% using privacy-preserving statistical queries to its own Df, 
and sends hi to the center. Note that hi C / because the statistical query algorithm must succeed for any 
possible D~ . Therefore, the center can simply compute the minimal h E7i such that h D hi for all i, which 
will have error at most e over each Di and therefore error at most e over D. □ 

For instance, the class of conjunctions can be learned via statistical queries to D + only by producing the 
conjunction of all variables Xj such that Pr D + [xj = 0] < ^ ± r, for r = Thus, Theorem 13 implies 
that conjunctions can be learned in a privacy-preserving manner without any communication overhead. 

Indeed, in all the algorithms for specific classes given in this paper, except for parity functions, the 
interaction between entities and their data can be simulated with statistical queries. For example, the decision 
list algorithm of Theorem 5 can be implemented by having each entity identify rules to send to the center 
via statistical queries to Di. Thus, in these or any other cases where the information required by the protocol 
can be extracted by each entity using statistical queries to its own data, there is no communication overhead 
due to preserving privacy. 

10.2 Differential privacy with respect to the entities 

One could also ask for a stronger privacy guarantee, that each entity be able to plausibly claim to be holding 
any other dataset it wishes; that is, to require e~ a < Pvs^cr)/ Prs'(cr) < e a for all Si and all (even 
unrelated) S'. This in fact corresponds precisely to the local privacy notion of Kasiviswanathan et al. [2008], 
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where in essence the only privacy-preserving mechanisms possible are via randomized-response. They 
show that any statistical query algorithm can be implemented in such a setting; however, because each 
entity is now viewed as essentially a single datapoint, to achieve any nontrivial accuracy, k must be quite 
large. 

10.3 Distributional privacy 

If the number of entities is small, but we still want privacy with respect to the entities themselves, then one 
type of privacy we can achieve is a notion of distributional privacy. Here we guarantee that that each player 
i reveals (essentially) no more information about its own sample Si than is inherent in Di itself. That is, we 
think of Si as "sensitive" but Di as "non-sensitive". Specifically, let us say a probabilistic mechanism A for 
answering a request q satisfies (a, 5) distributional privacy if 

Pr 

S,S'~Di 

In other words, with high probability, two random samples S, S' from Di have nearly the same probability 
of producing any given answer to request q. Blum et al. [2008] introduce a similar privacy notion, 10 which 
they show is strictly stronger than differential privacy, but do not provide efficient algorithms. Here, we 
show how distributional privacy can be implemented efficiently. 

Notice that in this context, an ideal privacy preserving mechanism would be for player i to somehow use 
its sample to reconstruct Di perfectly and then draw a "fake" sample from Di to use in its communication 
protocol. However, since reconstructing Di perfectly is not in general possible, we instead will work via 
statistical queries. 

Theorem 14. If T~L is learnable using M statistical queries of tolerance t, then % is learnable preserving 
distributional privacy from a sample of size 0( M '"ijf"^ ). 

Proof. We will show that we can achieve distributional privacy using statistical queries by adding additional 
Laplace noise beyond that required solely for differential privacy of the form in Section 10.1. 

Specifically, for any statistical query q, Hoeffding bounds imply that with probability at least 1 — 5', 
two random samples of size N will produce answers within (3 = 0(\f\og(l/5')/N) of each other (because 
each will be within /3/2 of the expectation with probability at least 1 — 5' /2). This quantity /3 can now 
be viewed as the "global sensitivity" of query q for distributional privacy. In particular, it suffices to add 
Laplace noise of width 0(/3/a') in order to achieve privacy parameter a' for this query q because we have 
that with probability at least 1 — 5', for two random samples S, S' of size N, for any v, Pr(A(S, q) = 
v) /Pr(A(S', q) = v) < e^/W^ = e a ' . Note that this has the property that with probability at least 1 - 5', 
the amount of noise added to any given answer is at most 0((/3/a') log (1/8')). 

If we have a total of M queries, then it suffices for preserving privacy over the entire sequence to set 
a' = a/M and 5' = 5/M. In order to have each query answered with high probability to within ±r, 
it suffices to have (3 + {(3 /a') log(l/<5') < cr for some constant c, where the additional (low-order) /3 
term is just the statistical estimation error without added noise. Solving for N, we find that a sample of 
size N = 0( M is sufficient to maintain distributional privacy while answering each query to 

tolerance r, as desired. □ 

'For example, if an entity is asked a question such as "do you have an example with x t = 1", then it flips a coin and with 
probability 1/2 + a' gives the correct answer and with probability 1/2 — a' gives the incorrect answer, for some appropriate a'. 
10 In the notion of Blum et al. [2008], Di is uniform over some domain and sampling is done without replacement. 
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As in the results of Section 10.1, Theorem 14 implies that if each player can run its portion of a desired 
communication protocol while only interacting with its own data via statistical queries, then so long as 
| | is sufficiently large, we can implement distributional privacy without any communication penalty by 
performing internal statistical queries privately as above. For example, combining Theorem 14 with the 
proof of Theorem 13 we have: 

Theorem 15. IfH can be properly learned via statistical queries to D + only, then % can be learned using 
one round and k hypotheses of total communication while preserving distributional privacy. 
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A Table of results 



Class / Category 


Communication 


Efficient? 


Conjunctions over {0, l} n 


0(nk) bits 


yes 


Parity functions over {0, l} n , k = 2 


0(n) bits 


yes 


Decision lists over {0, l} n 


0(nk\ogn) bits 


yes 


Linear separators in R d 


0(dlog(l/e)) examples* 


yes 


under radially-symmetric D 


0(k) examples 


yes 


under a-well-spread D 


0(k(l + a/7 2 )) hypotheses 


yes 


under non-concentrated D 


0(k 2 ^/d\og(dk/e)/e 2 ) hyps 


yes 


General Intersection-Closed 


k hypotheses 


see Note 1 below 


Boosting 


0(d log 1/e) examples* 


see Note 2 below 


Agnostic learning 


0(fclog(|ft|)log(l/e)) exs* 


see Note 3 below 



*: plus low-order additional bits of communication. 

Note 1: Efficient if can compute the smallest consistent hypothesis in % efficiently, and for any given 
h\, . . . , hj., can efficiently compute the minimum h ~D hi for all i. 
Note 2: Efficient if can efficiently weak-learn with 0(d) examples. 
Note 3: Efficient if can efficiently run robust halving algorithm for %. 

B Additional simple cases 
B.l Distribution-based algorithms 

An alternative basic approach, in settings where it can be done succinctly, is for each entity i to send to 
the center a representation of its (approximate) distribution over labeled data. Then, given the descriptions, 
the center can deduce an approximation of the overall distribution over labeled data and search for a near 
optimal hypothesis. This example is especially relevant for the agnostic 1-dimensional case, e.g., a union 
of d intervals over X = [0, 1]. Each entity first simply sorts the points, and determines d/e border points 
defining regions of probability mass (approximately) e/d. For each segment between two border points, 
the entity reports the fraction of positive versus negative examples. It additionally sends the border points 
themselves. This communication requires 0(d/e) border points and an additional 0(log d/e) bits to report 
the fractions within each such interval, per entity. Given this information, the center can approximate the 
best union of d intervals with error 0(e). Note that the supervised learning baseline algorithm would have a 
bound of (J(d/e 2 ) in terms of the number of points communicated. 

Theorem 16. There is an algorithm for agnostically learning a union of d intervals that uses one round and 
0(kd/e) values (each either a datapoint or a log d/e bit integer), such that the final hypothesis produced 
has error opt(%) + e. 

B.2 Version space algorithms 

Another simple case where one can perform well is when the version space can be compactly described. 
The version space of % given a sample Si is the set of all h G % which are consistent with Si. Denote this 

set by VerSp(U,Si). 

Generic Version Space Algorithm: Each entity sends V er Sp(T~L , Si) to the center. The center computes 
V = riiVerSp(7i, Si). Note that V = VerSp(H, UiSi). The center can send either V or some h G V. 
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Example (linear separators in [0, 1] 2 ): Assume that the points have margin 7. We can cover a convex set 
in [0, l] 2 using I/7 2 rectangles, whose union completely covers the convex set, and is completely covered 
by the convex set extended by 7. Each entity does this for its positive and negative regions, sending this 
(approximate) version space to the center. This gives a one-round algorithm with communication cost of 
0(l/7 2 ) points. 

C Linear Separators: Margin lower bound 

Proof. (Theorem 8) Suppose we have two players, each with their own set of examples, such that the 
combined dataset has a linear separator of margin 7. Suppose furthermore we run the perceptron algorithm 
where each player performs updates on their own dataset until consistent (or at least until low-error) and 
then passes the hypothesis on to the other player, with the process continuing until one player receives a 
hypothesis that is already low-error on its own data. How many rounds can this take in the worst case? 

Below is an example showing a problematic case where this can indeed result in 0(l/7 2 ) rounds. 

In this example, there are 3 dimensions and the target vector is (0, 1,0). Player 1 has the positive exam- 
ples, with 49% of its data points at location (1, 7, 37) and 49% of its data points are at location (1, 7, —7). 
The remainder of player l's points are at location (1, 7, 7). Player 2 has the negative examples. Half of its 
data points are at location (1, —7, —37) and half of its data points are at location (1, —7, 7). 

The following demonstrates a bad sequence of events that can occur, with the two players essentially 
fighting over the first coordinate: 



player 


updates using 


producing hypothesis 


player 1 


(1,7,7). + 


(1,7,7) 


player 2 


(1,-7,-37), - 


(0,27,47) 


player 2 


(1,-7:7). - 


(-1,37,37) 


player 1 


(1,7,37),+ 


(0,47,67) 


player 1 


(1,7,-7), + 


(1,57,57) 


player 2 


(1,-7,-37), - 


(0, 67, 87) 


player 2 


(1,-7,7). - 


(-1,77,77) 


player 1 


(1,7,37),+ 


(0,87, 10 7 ) 


player 1 


(1,7,-7), + 


(1,97,97) 



Notice that when the hypothesis looks like (—1,^7,^7), then the dot-product with the example (1, 7, 37) 
from player 1 is — 1+4&7 2 . So long as this is negative, player 1 will make two updates producing hypothesis 
(1, (k + 2)7, (k + 2)7). Then, so long as A(k + 2)7 2 < 1, player 2 will make two updates producing 
hypothesis (—1, (k + 4)7, (k + 4)7). Thus, this procedure will continue for Q,(l/~/ 2 ) rounds. □ 
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